[Quantization] Consolidate experts_int8 with fp8 online quantization by Josephasafg · Pull Request #38463 · vllm-project/vllm

Josephasafg · 2026-03-29T10:40:35Z

Purpose

Following up on this #38032 - This PR Consolidates experts_int8 with fp8's online quantization infrastructure (QeRL). Extracts shared online MoE quantization logic into a common base class and refactors fp8's MoE kernel infrastructure into a reusable mixin.

Test Plan

experts_int8 and fp8 tests should pass

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Josephasafg <ajgard7@gmail.com>

gemini-code-assist

Code Review

This pull request refactors the online MoE quantization infrastructure by introducing a common base class, OnlineMoEMethodBase, and a mixin, Fp8MoEKernelMixin, to share logic between different quantization methods. It migrates ExpertsInt8MoEMethod and Fp8OnlineMoEMethod to this new architecture, which utilizes meta-device weight allocation and deferred quantization after model loading. Review feedback identifies potential division-by-zero issues in the int8 quantization loop when encountering zero-valued weight rows and highlights inefficient cross-device tensor allocations for scale parameters that should be created on the same device as the weights.

vllm/model_executor/layers/quantization/experts_int8.py

Signed-off-by: Josephasafg <ajgard7@gmail.com>

mergify · 2026-03-29T21:05:51Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Josephasafg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…idation

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Josephasafg added 4 commits March 28, 2026 20:25

Online quantization support for experts_int8

d452b2d

Signed-off-by: Josephasafg <ajgard7@gmail.com>

Added online moe

3fb8121

Signed-off-by: Josephasafg <ajgard7@gmail.com>

Added device and type annotation

497f425

Signed-off-by: Josephasafg <ajgard7@gmail.com>

Semantics

0b9b111

Signed-off-by: Josephasafg <ajgard7@gmail.com>

Josephasafg mentioned this pull request Mar 29, 2026

[Quantization] - Consolidate experts_int8 with FP8 Modular Kernels #33178

Closed

6 tasks

gemini-code-assist bot reviewed Mar 29, 2026

View reviewed changes

vllm/model_executor/layers/quantization/experts_int8.py Show resolved Hide resolved

vllm/model_executor/layers/quantization/experts_int8.py Show resolved Hide resolved

vllm/model_executor/layers/quantization/experts_int8.py Show resolved Hide resolved

Josephasafg added 2 commits March 29, 2026 13:47

Added device to int8

6c9f74d

Signed-off-by: Josephasafg <ajgard7@gmail.com>

Fixed pre commit

a9bff7a

Signed-off-by: Josephasafg <ajgard7@gmail.com>

mergify bot added the needs-rebase label Mar 29, 2026

Josephasafg and others added 2 commits March 30, 2026 10:31

Merge remote-tracking branch 'upstream/main' into experts_int8_consol…

0dde06b

…idation

Merge branch 'main' into experts_int8_consolidation

559535e

mergify bot removed the needs-rebase label Mar 30, 2026

Josephasafg marked this pull request as ready for review March 30, 2026 07:33

Josephasafg requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners March 30, 2026 07:33

claude bot reviewed Mar 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Quantization] Consolidate experts_int8 with fp8 online quantization#38463

[Quantization] Consolidate experts_int8 with fp8 online quantization#38463
Josephasafg wants to merge 8 commits intovllm-project:mainfrom
Josephasafg:experts_int8_consolidation

Josephasafg commented Mar 29, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Mar 29, 2026

Uh oh!

claude bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Josephasafg commented Mar 29, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Mar 29, 2026

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Josephasafg commented Mar 29, 2026 •

edited by github-actions bot

Loading